13 research outputs found
Text Mining applied to Molecular Biology
This thesis describes the development of text-mining
algorithms for molecular biology, in particular for DNA microarray
data analysis. Concept profiles were introduced, which characterize
the context in which a gene is mentioned in literature, to retrieve
functional associations between genes. The method was shown to
efficiently annotate DNA microarray data and complement existing
methods. Concept profiles were also used for other types of concepts
and were successfully applied for functional annotation of genes
through automatic assignment of Gene Ontology terms to genes. A
generic framework has been developed based on concept profiles, dubbed
Anni (www.biosemantics.org/anni), to provide researchers with an
ontology-based interface to the literature and we demonstrated its
utility for literature-based knowledge discovery. Use and development
of text-mining tools to identify relations between genes and to
automatically annotate sets of genes resulting from !
microarray experiments.
Comparing DNA microarray studies can reveal interesting parallels.
However, such analyses are hampered by the large influences of design,
technical and statistical factors on the found differentially
expressed genes. Comparisons based on perturbed biological processes
could be more robust. Concept profiles were used to reveal overlapping
biological processes between microarray studies in a comparative meta-
analysis of 102 muscle-related microarray studies. We demonstrated
that many more biologically meaningful links could be retrieved
between studies, even between studies without differentially expressed
genes in common
Mining microarray datasets aided by knowledge stored in literature
DNA microarray technology produces large amounts of data. For data mining
of these datasets, background information on genes can be helpful.
Unfortunately most information is stored in free text. Here, we present an
approach to use this information for DNA microarray data mining
Ambiguity of human gene symbols in LocusLink and MEDLINE: creating an inventory and a disambiguation test collection
Genes are discovered almost on a daily basis and new names have to be
found. Although there are guidelines for gene nomenclature, the naming
process is highly creative. Human genes are often named with a gene symbol
and a longer, more descriptive term; the short form is very often an
abbreviation of the long form. Abbreviations in biomedical language are
highly ambiguous, i.e., one gene symbol often refers to more than one
gene.Using an existing abbreviation expansion algorithm,we explore MEDLINE
for the use of human gene symbols derived from LocusLink. It turns out
that just over 40% of these symbols occur in MEDLINE, however, many of
these occurrences are not related to genes. Along the process of making an
inventory, a disambiguation test collection is constructed automatically
Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes
MOTIVATION: The advent of high-throughput experiments in molecular biology creates a need for methods to efficiently extract and use information for large numbers of genes. Recently, the associative concept space (ACS) has been developed for the representation of information extracted from biomedical literature. The ACS is a Euclidean space in which thesaurus concepts are positioned and the distances between concepts indicates their relatedness. The ACS uses co-occurrence of concepts as a source of information. In this paper we evaluate how well the system can retrieve functionally related genes and we compare its performance with a simple gene co-occurrence method. RESULTS: To assess the performance of the ACS we composed a test set of five groups of functionally related genes. With the ACS good scores were obtained for four of the five groups. When compared to the gene co-occurrence method, the ACS is capable of revealing more functional biological relations and can achieve results with less literature available per gene. Hierarchical clustering was performed on the ACS output, as a potential aid to users, and was found to provide useful clusters. Our results suggest that the algorithm can be of value for researchers studying large numbers of genes. AVAILABILITY: The ACS program is available upon request from the authors
Using contextual queries
Search engines generally treat search requests in isolation. The results
for a given query are identical, independent of the user, or the context
in which the user made the request. An approach is demonstrated that
explores implicit contexts as obtained from a document the user is
reading. The approach inserts into an original (web) document
functionality to directly activate context driven queries that yield
related articles obtained from various information sources
Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation
BACKGROUND: High-throughput experiments, such as with DNA microarrays, typically result in hundreds of genes potentially relevant to the process under study, rendering the interpretation of these experiments problematic. Here, we propose and evaluate an approach to find functional associations between large numbers of genes and other biomedical concepts from free-text literature. For each gene, a profile of related concepts is constructed that summarizes the context in which the gene is mentioned in literature. We assign a weight to each concept in the profile based on a likelihood ratio measure. Gene concept profiles can then be clustered to find related genes and other concepts. RESULTS: The experimental validation was done in two steps. We first applied our method on a controlled test set. After this proved to be successful the datasets from two DNA microarray experiments were analyzed in the same way and the results were evaluated by domain experts. The first dataset was a gene-expression profile that characterizes the cancer cells of a group of acute myeloid leukemia patients. For this group of patients the biological background of the cancer cells is largely unknown. Using our methodology we found an association of these cells to monocytes, which agreed with other experimental evidence. The second data set consisted of differentially expressed genes following androgen receptor stimulation in a prostate cancer cell line. Based on the analysis we put forward a hypothesis about the biological processes induced in these studied cells: secretory lysosomes are involved in the production of prostatic fluid and their development and/or secretion are androgen-regulated processes. CONCLUSION: Our method can be used to analyze DNA microarray datasets based on information explicitly and implicitly available in the literature. We provide a publicly available tool, dubbed Anni, for this purpose
Integrated Genome-Scale Prediction of Detrimental Mutations in Transcription Networks
A central challenge in genetics is to understand when and why mutations alter the phenotype of an organism. The consequences of gene inhibition have been systematically studied and can be predicted reasonably well across a genome. However, many sequence variants important for disease and evolution may alter gene regulation rather than gene function. The consequences of altering a regulatory interaction (or “edge”) rather than a gene (or “node”) in a network have not been as extensively studied. Here we use an integrative analysis and evolutionary conservation to identify features that predict when the loss of a regulatory interaction is detrimental in the extensively mapped transcription network of budding yeast. Properties such as the strength of an interaction, location and context in a promoter, regulator and target gene importance, and the potential for compensation (redundancy) associate to some extent with interaction importance. Combined, however, these features predict quite well whether the loss of a regulatory interaction is detrimental across many promoters and for many different transcription factors. Thus, despite the potential for regulatory diversity, common principles can be used to understand and predict when changes in regulation are most harmful to an organism